Attrition Case Study for Frito Lay

A case study into the work force data set for MSDS 6306 by Renu Karthikeyan

Link to Attrition Shiny App

CaseStudy2 Attrition

Statement of Purpose:

I am pleased to present this case study on Frito Lay attrition, which aims to analyze and derive actionable insights from the workforce data at Frito Lay. The purpose of this study is to understand the factors influencing employee attrition, develop predictive models for attrition risk, and evaluate the performance of the predictive models.

Objectives:

  1. Data Exploration and Visualization: Conduct a thorough exploration of the available workforce data to gain insights into the distribution of various attributes.Visualize key trends and patterns related to attrition, employee demographics, and other relevant factors.
  2. Predictive Modeling: Use K-nearest neighbors (KNN) and Naive Bayes, to build predictive models for employee attrition.Evaluate the performance of each model and identify the most effective approach in predicting attrition risk. Create linear regression model to predict salary for employees, given all other predictors.
  3. Shiny App Development: Create an interactive Shiny app to visualize and communicate the insights derived from the analysis.Provide a user-friendly platform for stakeholders to explore the data and understand the implications of the findings.
  4. Communication and Collaboration: Effectively communicate the results and recommendations to stakeholders through clear and concise reports and presentations.

Load and Install Libraries

library(dplyr)
library(ggplot2)
library(caret)
library(aws.s3)
library(RCurl)
library(readr)
library(base)
library(tidyverse)
library(naniar)
library(class)
library(GGally)
library(e1071)
library(car)
library(fastDummies)

Set AWS credentials, and Load in Datasets from Amazon S3 Bucket

Here, loading in the data from the AWS S3 Bucket. I did some slight clean up of the data, to exclude the “Over18” column in the original data set and the Attrition test data set. The column name for ID in the No Salary data set was not the same, so I adjusted that. Also, I did a check for any missing values. There are no missing values in any of the data sets, so there is no need for imputing or deleting of rows.

# Load the Attrition data set from S3
#s3_path <- "s3://msds.ds.6306.2/CaseStudy2-data.csv"
# Read the Attrition Data CSV file from S3
Attritiondata <- read.table(textConnection(getURL("https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2-data.csv")), sep =",", header =T)
head(Attritiondata,5)
##   ID Age Attrition    BusinessTravel DailyRate             Department
## 1  1  32        No     Travel_Rarely       117                  Sales
## 2  2  40        No     Travel_Rarely      1308 Research & Development
## 3  3  35        No Travel_Frequently       200 Research & Development
## 4  4  32        No     Travel_Rarely       801                  Sales
## 5  5  24        No Travel_Frequently       567 Research & Development
##   DistanceFromHome Education   EducationField EmployeeCount EmployeeNumber
## 1               13         4    Life Sciences             1            859
## 2               14         3          Medical             1           1128
## 3               18         2    Life Sciences             1           1412
## 4                1         4        Marketing             1           2016
## 5                2         1 Technical Degree             1           1646
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         73              3        2
## 2                       3   Male         44              2        5
## 3                       3   Male         60              3        3
## 4                       3 Female         48              3        3
## 5                       1 Female         32              3        1
##                  JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        Sales Executive               4      Divorced          4403
## 2      Research Director               3        Single         19626
## 3 Manufacturing Director               4        Single          9362
## 4        Sales Executive               4       Married         10422
## 5     Research Scientist               4        Single          3760
##   MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1        9250                  2      Y       No                11
## 2       17544                  1      Y       No                14
## 3       19944                  2      Y       No                11
## 4       24032                  1      Y       No                19
## 5       17218                  1      Y      Yes                13
##   PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1                 3                        3            80                1
## 2                 3                        1            80                0
## 3                 3                        3            80                0
## 4                 3                        3            80                2
## 5                 3                        3            80                0
##   TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1                 8                     3               2              5
## 2                21                     2               4             20
## 3                10                     2               3              2
## 4                14                     3               3             14
## 5                 6                     2               3              6
##   YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1                  2                       0                    3
## 2                  7                       4                    9
## 3                  2                       2                    2
## 4                 10                       5                    7
## 5                  3                       1                    3
#summary(Attritiondata)
vis_miss(Attritiondata) #checking for no missing values

#Read CSV "testing" files from S3

# Reading in of NoSalary Dataset from S3 Bucket
NoSalary<-read.table( textConnection(getURL
("https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2CompSet+No+Salary.csv"
)), sep=",", header=TRUE)
head(NoSalary,5)
##   ï..ID Age Attrition    BusinessTravel DailyRate             Department
## 1   871  43        No Travel_Frequently      1422                  Sales
## 2   872  33        No     Travel_Rarely       461 Research & Development
## 3   873  55       Yes     Travel_Rarely       267                  Sales
## 4   874  36        No        Non-Travel      1351 Research & Development
## 5   875  27        No     Travel_Rarely      1302 Research & Development
##   DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1                2         4  Life Sciences             1           1849
## 2               13         1  Life Sciences             1            995
## 3               13         4      Marketing             1           1372
## 4                9         4  Life Sciences             1           1949
## 5               19         3          Other             1           1619
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       1   Male         92              3        2
## 2                       2 Female         53              3        1
## 3                       1   Male         85              4        4
## 4                       1   Male         66              4        1
## 5                       4   Male         67              2        1
##                 JobRole JobSatisfaction MaritalStatus MonthlyRate
## 1       Sales Executive               4       Married       19246
## 2    Research Scientist               4        Single       17241
## 3       Sales Executive               3        Single        9277
## 4 Laboratory Technician               2       Married        9238
## 5 Laboratory Technician               1      Divorced       16290
##   NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1                  1      Y       No                20                 4
## 2                  3      Y       No                18                 3
## 3                  6      Y      Yes                17                 3
## 4                  1      Y       No                22                 4
## 5                  1      Y       No                11                 3
##   RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1                        3            80                1                 7
## 2                        1            80                0                 5
## 3                        3            80                0                24
## 4                        2            80                0                 5
## 5                        1            80                2                 7
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     5               3              7                  7
## 2                     4               3              3                  2
## 3                     2               2             19                  7
## 4                     3               3              5                  4
## 5                     3               3              7                  7
##   YearsSinceLastPromotion YearsWithCurrManager
## 1                       7                    7
## 2                       0                    2
## 3                       3                    8
## 4                       0                    2
## 5                       0                    7
#summary(NoSalary)
vis_miss(NoSalary)

AttritionTest<- read.table(textConnection(getURL
("https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2CompSet+No+Attrition.csv"
)), sep=",", header=TRUE)
head(AttritionTest,5)
##     ID Age BusinessTravel DailyRate             Department DistanceFromHome
## 1 1171  35  Travel_Rarely       750 Research & Development               28
## 2 1172  33  Travel_Rarely       147        Human Resources                2
## 3 1173  26  Travel_Rarely      1330 Research & Development               21
## 4 1174  55  Travel_Rarely      1311 Research & Development                2
## 5 1175  29  Travel_Rarely      1246                  Sales               19
##   Education  EducationField EmployeeCount EmployeeNumber
## 1         3   Life Sciences             1           1596
## 2         3 Human Resources             1           1207
## 3         3         Medical             1           1107
## 4         3   Life Sciences             1            505
## 5         3   Life Sciences             1           1497
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         46              4        2
## 2                       2   Male         99              3        1
## 3                       1   Male         37              3        1
## 4                       3 Female         97              3        4
## 5                       3   Male         77              2        2
##                 JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Laboratory Technician               3       Married          3407       25348
## 2       Human Resources               3       Married          3600        8429
## 3 Laboratory Technician               3      Divorced          2377       19373
## 4               Manager               4        Single         16659       23258
## 5       Sales Executive               3      Divorced          8620       23757
##   NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1                  1      Y       No                17                 3
## 2                  1      Y       No                13                 3
## 3                  1      Y       No                20                 4
## 4                  2      Y      Yes                13                 3
## 5                  1      Y       No                14                 3
##   RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1                        4            80                2                10
## 2                        4            80                1                 5
## 3                        3            80                1                 1
## 4                        3            80                0                30
## 5                        3            80                2                10
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     3               2             10                  9
## 2                     2               3              5                  4
## 3                     0               2              1                  1
## 4                     2               3              5                  4
## 5                     3               3             10                  7
##   YearsSinceLastPromotion YearsWithCurrManager
## 1                       6                    8
## 2                       1                    4
## 3                       0                    0
## 4                       1                    2
## 5                       0                    4
#summary(AttritionTest)
vis_miss(AttritionTest)

Attritiondata <- subset(Attritiondata, select = -c(Over18))
head(Attritiondata,5)
##   ID Age Attrition    BusinessTravel DailyRate             Department
## 1  1  32        No     Travel_Rarely       117                  Sales
## 2  2  40        No     Travel_Rarely      1308 Research & Development
## 3  3  35        No Travel_Frequently       200 Research & Development
## 4  4  32        No     Travel_Rarely       801                  Sales
## 5  5  24        No Travel_Frequently       567 Research & Development
##   DistanceFromHome Education   EducationField EmployeeCount EmployeeNumber
## 1               13         4    Life Sciences             1            859
## 2               14         3          Medical             1           1128
## 3               18         2    Life Sciences             1           1412
## 4                1         4        Marketing             1           2016
## 5                2         1 Technical Degree             1           1646
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         73              3        2
## 2                       3   Male         44              2        5
## 3                       3   Male         60              3        3
## 4                       3 Female         48              3        3
## 5                       1 Female         32              3        1
##                  JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        Sales Executive               4      Divorced          4403
## 2      Research Director               3        Single         19626
## 3 Manufacturing Director               4        Single          9362
## 4        Sales Executive               4       Married         10422
## 5     Research Scientist               4        Single          3760
##   MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1        9250                  2       No                11                 3
## 2       17544                  1       No                14                 3
## 3       19944                  2       No                11                 3
## 4       24032                  1       No                19                 3
## 5       17218                  1      Yes                13                 3
##   RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1                        3            80                1                 8
## 2                        1            80                0                21
## 3                        3            80                0                10
## 4                        3            80                2                14
## 5                        3            80                0                 6
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     3               2              5                  2
## 2                     2               4             20                  7
## 3                     2               3              2                  2
## 4                     3               3             14                 10
## 5                     2               3              6                  3
##   YearsSinceLastPromotion YearsWithCurrManager
## 1                       0                    3
## 2                       4                    9
## 3                       2                    2
## 4                       5                    7
## 5                       1                    3
colnames(NoSalary)[colnames(NoSalary)=="ï..ID"] <- "ID"
colnames(NoSalary)
##  [1] "ID"                       "Age"                     
##  [3] "Attrition"                "BusinessTravel"          
##  [5] "DailyRate"                "Department"              
##  [7] "DistanceFromHome"         "Education"               
##  [9] "EducationField"           "EmployeeCount"           
## [11] "EmployeeNumber"           "EnvironmentSatisfaction" 
## [13] "Gender"                   "HourlyRate"              
## [15] "JobInvolvement"           "JobLevel"                
## [17] "JobRole"                  "JobSatisfaction"         
## [19] "MaritalStatus"            "MonthlyRate"             
## [21] "NumCompaniesWorked"       "Over18"                  
## [23] "OverTime"                 "PercentSalaryHike"       
## [25] "PerformanceRating"        "RelationshipSatisfaction"
## [27] "StandardHours"            "StockOptionLevel"        
## [29] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [31] "WorkLifeBalance"          "YearsAtCompany"          
## [33] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [35] "YearsWithCurrManager"
AttritionTest <- subset(AttritionTest, select = -c(Over18))
head(AttritionTest,5)
##     ID Age BusinessTravel DailyRate             Department DistanceFromHome
## 1 1171  35  Travel_Rarely       750 Research & Development               28
## 2 1172  33  Travel_Rarely       147        Human Resources                2
## 3 1173  26  Travel_Rarely      1330 Research & Development               21
## 4 1174  55  Travel_Rarely      1311 Research & Development                2
## 5 1175  29  Travel_Rarely      1246                  Sales               19
##   Education  EducationField EmployeeCount EmployeeNumber
## 1         3   Life Sciences             1           1596
## 2         3 Human Resources             1           1207
## 3         3         Medical             1           1107
## 4         3   Life Sciences             1            505
## 5         3   Life Sciences             1           1497
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         46              4        2
## 2                       2   Male         99              3        1
## 3                       1   Male         37              3        1
## 4                       3 Female         97              3        4
## 5                       3   Male         77              2        2
##                 JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Laboratory Technician               3       Married          3407       25348
## 2       Human Resources               3       Married          3600        8429
## 3 Laboratory Technician               3      Divorced          2377       19373
## 4               Manager               4        Single         16659       23258
## 5       Sales Executive               3      Divorced          8620       23757
##   NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1                  1       No                17                 3
## 2                  1       No                13                 3
## 3                  1       No                20                 4
## 4                  2      Yes                13                 3
## 5                  1       No                14                 3
##   RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1                        4            80                2                10
## 2                        4            80                1                 5
## 3                        3            80                1                 1
## 4                        3            80                0                30
## 5                        3            80                2                10
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     3               2             10                  9
## 2                     2               3              5                  4
## 3                     0               2              1                  1
## 4                     2               3              5                  4
## 5                     3               3             10                  7
##   YearsSinceLastPromotion YearsWithCurrManager
## 1                       6                    8
## 2                       1                    4
## 3                       0                    0
## 4                       1                    2
## 5                       0                    4

EDA

Looking at relationships within overall data, not just those attrited

ggplot(data=Attritiondata, aes(x=JobSatisfaction)) +geom_bar(position="dodge") + theme_minimal() + ggtitle("Overall Job Satisfaction")

Here, we look at overall Job Satisfaction among the employees. It looks like majority of employees seem to be satisfied with their job, with a handful not being as satisfied. It is a left skewed histogram.

ggplot(data=Attritiondata,aes(x=MonthlyIncome, y=Age)) + geom_point(position="jitter") + facet_wrap(~MaritalStatus)+geom_smooth(method="loess") + ggtitle("Monthly Income and Age categorized by Marital Status")

Here, we see Monthly Income by Age categorized by Marital Status. Starting off, it looks like Divorced and Married people tend to make more as their age increases. However, the same positive trend is seen for Single people, but they do not make as much as their Married or Divorced coworkers. But there is definitely a positive trend between age and monthly income.

ggplot(data = Attritiondata, aes(x = MonthlyIncome, y = Age, color = JobInvolvement)) +
  geom_point(position = "jitter") +
  geom_smooth(method = lm) +
  ggtitle("Job Involvement and Monthly Income") + facet_wrap(~JobInvolvement)

Looking at job involvement and monthly Income, it looks most employees regardless of Job Involvement have a positive correlation between Age and Monthly Income. I was trying to see if those who were more involved in their job made a higher income, but that does not seem to be the case at first glance. It looks like a lot of employees are pretty involved in their jobs (Job Involvement levels 2 and 3).

ggplot(data = Attritiondata, aes(x = MonthlyIncome, y = Age, color = interaction(JobInvolvement))) +
  geom_point(position = "jitter") +
  geom_smooth(method = lm, se = FALSE) +
  ggtitle("Job Involvement and Monthly Income")

When I plotted all 4 job levels with their respective linear regression lines, it looks like those with least job involvement (Job Involvement 1) start to make more after crossing 40 years old and a monthly income of 10,000. It seems like Job Involvement 3 make the most starting off up to a monthly income of 10,000 and ~40 years old, and start to make the least as they near 50 years old. Job Involvements 2 and 4 linear regression lines fall in between the 1 and 4 lines initially and towards the end, they are very close to each otehr, and seem to overlap.

ggplot(data=Attritiondata,aes(x=MonthlyIncome,y=NumCompaniesWorked)) + geom_point(position="jitter")  + geom_smooth(method=lm) + ggtitle("Number of Companies Worked and Monthly Income")

There is positive correlation between Monthly Income and Number of Companies worked.

ggplot(data = Attritiondata, aes(x = MonthlyIncome)) + geom_histogram()  + ggtitle("Monthly Income Histogram")

Looking at a histogram of Monthly Income, it looks to be right skewed. The mode is less than the median which is less than the mean.

ggplot(data=Attritiondata,aes(x=MonthlyIncome)) + geom_histogram() + ggtitle("Monthly Income Histogram Categorized by Gender") + facet_wrap(~Gender)

Looking at monthly income categorized by Gender, it looks like there are more male datapoints in the dataset than femalses. The mode is higher for men than it is for women. Both histograms are right skewed.

ggplot(data=Attritiondata,aes(x=MonthlyIncome, y=DistanceFromHome)) + geom_point(position="jitter") + geom_smooth(method=loess) + ggtitle("Monthly Income and Distance from Home")

There seems to be a negative correlation between distance from home and monthly income. The plot and loess curve imply that as distance from home initially increases, monthly income also increases, but after a monthly income of 15,000 is exceeded, there is an overall decrease in the distance from home. The curve resembles a concave curve, with the downward slight ‘w’ shape. Looking at this graph, I would interpret distance from home to be in miles, because if this is in kilometers, it doesn’t seem to make logical sense.

ggplot(data=Attritiondata,aes(x=MonthlyIncome, y=DistanceFromHome)) + geom_point(position="jitter") + geom_smooth(method=loess) + facet_wrap(~Gender) + ggtitle("Monthly Income and Distance from Home categorized by Gender")

There seems to be a more prominent downturned ‘w’ shape for women than for men. But the overall relationship is as mentioned above (between distance from home and Monthly Income).

ggplot(data=Attritiondata, aes(x=Department, y=JobSatisfaction, color = Gender)) + geom_point(position ="jitter") + facet_wrap(~Gender) + ggtitle("Department and Job Satisfaction by Gender")

Job Satisfaction seems to be higher for Research and Development for both Males and Females. There are less data points for those in Human Resources so a clear defined relationship can’t be concluded. There seems to be decent job satisfaction for those in Sales too for both genders.

ggplot(data=Attritiondata,aes(x=Age, y=TotalWorkingYears, color = MonthlyIncome)) + geom_point(position = "jitter") + ggtitle("Age and Total Working Years with Monthly Income")

This plot confirms that as Age increases, total working years also increase, and monthly income seems to follow the same trend as well. There is a positive correlation between all 3 variables - Age, Total working years, and monthly income.

Attrition Specific Analysis

I filtered the data to look at specifically those who Attrited, to find some insights and relationships between the variables.

attrition_yes <- dplyr::filter(Attritiondata, Attrition == "Yes")
ggplot(data = attrition_yes, aes(x = Department, fill = Gender))+ geom_bar(position = "dodge") + 
  ggtitle("Attrition by Department and Gender") + theme_minimal()

Of those who left the company, many men were in Research & Development and Sales, while there were equal amounts of women in Research & Development and Sales. There are equal amounts of men and women from Human resources who left the company.

ggplot(data = attrition_yes, aes(y = JobSatisfaction , x = DistanceFromHome, color = MonthlyIncome))+ 
geom_point(position = "jitter") + theme_minimal() + geom_smooth(method =lm) + ggtitle("Attrition by Distance from Home, Job Satisfaction with Monthly Income")

Of those who left the company, it seems likes a lot of them were making under $10,000 monthly. There seems to be a negative relationship between Job Satisfaction and Distance from home (as distance from home increases in miles, the job satisfaction goes down).

ggplot(data = attrition_yes, aes(y=JobLevel, x = MonthlyIncome, color = Age))+
  geom_point(position="jitter")+geom_smooth(method=lm) + ggtitle("Monthly Income and Job Level with Age")

Of those who left the company, there is a positive relationship between job level and Monthly income. Age seems to be scattered, but at a job level of around 1, and monthly income less than 5000, the age group seems to have employees in their 20s.

ggplot(data = attrition_yes, aes(y=JobLevel, x = Age, color = Gender))+
  geom_point(position="jitter")+geom_smooth(method=lm) + ggtitle("Attrition Job Level by Age and Gender")

OF those who left the company, it looks like Job Level is positively correlated with Age for both Males and Females. The slope of the linear regression line for females seems to be more steep than it is for the males.

ggplot(data = attrition_yes, aes(x=OverTime, fill = Gender)) + geom_bar() + ggtitle("Attrition -  Overtime by Gender")

Of those who left the company, many employees were working over time. As mentioned earlier, there are more male data points compared to females, which is why at first glance it may seem a bit off. It looks like of the Over time group - approximately 70% were males, and 30% was females.

ggplot(data = attrition_yes, aes(x=NumCompaniesWorked, y = PercentSalaryHike)) + geom_point(position = "jitter") + theme_minimal()+geom_smooth(method = lm) + ggtitle ("Attrition - Percent Salary Hike and Number of Companies Worked")

There seems to be a slight downward trend when looking at the relationship between Number of companies worked and Percent salary hike. I was trying to see if the number of companies worked affected the percent salary hike positively. At first glance, it seems like there is no variation int he line, but if you observe closely, there is a slight downward trend.

ggplot(data = attrition_yes, aes(x = PercentSalaryHike, fill = OverTime)) +
  geom_histogram( binwidth = 1, color = "black", alpha = 0.7) +
  geom_density(aes(y = ..count..), fill = "transparent", color = "darkblue") +
  labs(title = "Histogram with Trend Line of % Salary Hike by Overtime",
       x = "Percent Salary Hike", y = "Count") +
  theme_minimal() + theme(legend.position = "top")  # Adjust legend position

This is a histogram of percent salary hike, with each bar being split and shaded by if the employee(s) were working over time. The overall percent salary hike (without the split) seems to be right skewed generally, but the shape of the line/trend seems to imply that it may be multimodal.

ggplot(data = attrition_yes, aes(x=YearsAtCompany, y = PercentSalaryHike, color = PerformanceRating)) + geom_point() + theme_minimal() + ggtitle("Salary Hike v. Years at Company with Performance Rating")

This plot takes a look at Years at Company and Percent salary hike. Those with a higher performance rating seem to have a higher percent salary hike. For a Percent salary hike less than 20%, it seems like the performance rating is under 4, and trends around the 3 rating.

ggplot(data = attrition_yes, aes(y=JobSatisfaction, x = Education)) + geom_point(position="jitter") + theme_minimal() + ggtitle("Job Satisfaction v. Education Categorized by Education Field") + facet_wrap(~EducationField) + geom_smooth(method = lm)

This plot looks at the relationship between Job Satisfaction based on Education categorized by education field. Most education fields seem to have a negative correlation between job satisfaction and years of education, except for the Medical field. The medical field is the only field where as education increases, the job satifaction also seems to increase.

#Building a Regression Model to Determine Salary

Regression Model 1 Using All Predictors to Determine Salary

Monthly Income is the “salary” variable

class(Attritiondata$MonthlyIncome)
## [1] "integer"
sum(is.na(Attritiondata$MonthlyIncome))
## [1] 0
summary(Attritiondata)
##        ID             Age         Attrition         BusinessTravel    
##  Min.   :  1.0   Min.   :18.00   Length:870         Length:870        
##  1st Qu.:218.2   1st Qu.:30.00   Class :character   Class :character  
##  Median :435.5   Median :35.00   Mode  :character   Mode  :character  
##  Mean   :435.5   Mean   :36.83                                        
##  3rd Qu.:652.8   3rd Qu.:43.00                                        
##  Max.   :870.0   Max.   :60.00                                        
##    DailyRate       Department        DistanceFromHome   Education    
##  Min.   : 103.0   Length:870         Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 472.5   Class :character   1st Qu.: 2.000   1st Qu.:2.000  
##  Median : 817.5   Mode  :character   Median : 7.000   Median :3.000  
##  Mean   : 815.2                      Mean   : 9.339   Mean   :2.901  
##  3rd Qu.:1165.8                      3rd Qu.:14.000   3rd Qu.:4.000  
##  Max.   :1499.0                      Max.   :29.000   Max.   :5.000  
##  EducationField     EmployeeCount EmployeeNumber   EnvironmentSatisfaction
##  Length:870         Min.   :1     Min.   :   1.0   Min.   :1.000          
##  Class :character   1st Qu.:1     1st Qu.: 477.2   1st Qu.:2.000          
##  Mode  :character   Median :1     Median :1039.0   Median :3.000          
##                     Mean   :1     Mean   :1029.8   Mean   :2.701          
##                     3rd Qu.:1     3rd Qu.:1561.5   3rd Qu.:4.000          
##                     Max.   :1     Max.   :2064.0   Max.   :4.000          
##     Gender            HourlyRate     JobInvolvement     JobLevel    
##  Length:870         Min.   : 30.00   Min.   :1.000   Min.   :1.000  
##  Class :character   1st Qu.: 48.00   1st Qu.:2.000   1st Qu.:1.000  
##  Mode  :character   Median : 66.00   Median :3.000   Median :2.000  
##                     Mean   : 65.61   Mean   :2.723   Mean   :2.039  
##                     3rd Qu.: 83.00   3rd Qu.:3.000   3rd Qu.:3.000  
##                     Max.   :100.00   Max.   :4.000   Max.   :5.000  
##    JobRole          JobSatisfaction MaritalStatus      MonthlyIncome  
##  Length:870         Min.   :1.000   Length:870         Min.   : 1081  
##  Class :character   1st Qu.:2.000   Class :character   1st Qu.: 2840  
##  Mode  :character   Median :3.000   Mode  :character   Median : 4946  
##                     Mean   :2.709                      Mean   : 6390  
##                     3rd Qu.:4.000                      3rd Qu.: 8182  
##                     Max.   :4.000                      Max.   :19999  
##   MonthlyRate    NumCompaniesWorked   OverTime         PercentSalaryHike
##  Min.   : 2094   Min.   :0.000      Length:870         Min.   :11.0     
##  1st Qu.: 8092   1st Qu.:1.000      Class :character   1st Qu.:12.0     
##  Median :14074   Median :2.000      Mode  :character   Median :14.0     
##  Mean   :14326   Mean   :2.728                         Mean   :15.2     
##  3rd Qu.:20456   3rd Qu.:4.000                         3rd Qu.:18.0     
##  Max.   :26997   Max.   :9.000                         Max.   :25.0     
##  PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
##  Min.   :3.000     Min.   :1.000            Min.   :80    Min.   :0.0000  
##  1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000  
##  Median :3.000     Median :3.000            Median :80    Median :1.0000  
##  Mean   :3.152     Mean   :2.707            Mean   :80    Mean   :0.7839  
##  3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000  
##  Max.   :4.000     Max.   :4.000            Max.   :80    Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.05     Mean   :2.832         Mean   :2.782   Mean   : 6.962  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.:10.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.00       
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.00       
##  Median : 3.000     Median : 1.000          Median : 3.00       
##  Mean   : 4.205     Mean   : 2.169          Mean   : 4.14       
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.00       
##  Max.   :18.000     Max.   :15.000          Max.   :17.00
# Identify character variables
char_vars <- sapply(Attritiondata, is.character)
# Convert character variables to factors
Attritiondata[, char_vars] <- lapply(Attritiondata[, char_vars], as.factor)

#Check Factor Levels for Categorical variables:
sapply(Attritiondata[,char_vars],levels)
## $Attrition
## [1] "No"  "Yes"
## 
## $BusinessTravel
## [1] "Non-Travel"        "Travel_Frequently" "Travel_Rarely"    
## 
## $Department
## [1] "Human Resources"        "Research & Development" "Sales"                 
## 
## $EducationField
## [1] "Human Resources"  "Life Sciences"    "Marketing"        "Medical"         
## [5] "Other"            "Technical Degree"
## 
## $Gender
## [1] "Female" "Male"  
## 
## $JobRole
## [1] "Healthcare Representative" "Human Resources"          
## [3] "Laboratory Technician"     "Manager"                  
## [5] "Manufacturing Director"    "Research Director"        
## [7] "Research Scientist"        "Sales Executive"          
## [9] "Sales Representative"     
## 
## $MaritalStatus
## [1] "Divorced" "Married"  "Single"  
## 
## $OverTime
## [1] "No"  "Yes"
  #Noticed Over18 has only 1 factor level; so going to remove from dataset
 # Attritiondata <- subset(Attritiondata, select = -c(Over18))

# Fit the linear regression model with all predictors
Model1_fit <- lm(MonthlyIncome ~ ., data = Attritiondata)
summary(Model1_fit)
## 
## Call:
## lm(formula = MonthlyIncome ~ ., data = Attritiondata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3708.8  -674.1    14.7   614.1  4100.0 
## 
## Coefficients: (2 not defined because of singularities)
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       6.013e+01  7.772e+02   0.077 0.938349    
## ID                               -2.343e-01  1.476e-01  -1.588 0.112713    
## Age                              -1.431e+00  5.649e+00  -0.253 0.800110    
## AttritionYes                      8.904e+01  1.154e+02   0.771 0.440729    
## BusinessTravelTravel_Frequently   1.895e+02  1.420e+02   1.334 0.182441    
## BusinessTravelTravel_Rarely       3.720e+02  1.200e+02   3.099 0.002005 ** 
## DailyRate                         1.452e-01  9.129e-02   1.591 0.112062    
## DepartmentResearch & Development  1.234e+02  4.768e+02   0.259 0.795866    
## DepartmentSales                  -4.594e+02  4.877e+02  -0.942 0.346580    
## DistanceFromHome                 -6.237e+00  4.578e+00  -1.362 0.173417    
## Education                        -3.743e+01  3.716e+01  -1.007 0.314105    
## EducationFieldLife Sciences       1.352e+02  3.692e+02   0.366 0.714248    
## EducationFieldMarketing           1.377e+02  3.914e+02   0.352 0.725050    
## EducationFieldMedical             3.326e+01  3.699e+02   0.090 0.928376    
## EducationFieldOther               9.152e+01  3.946e+02   0.232 0.816664    
## EducationFieldTechnical Degree    9.680e+01  3.843e+02   0.252 0.801179    
## EmployeeCount                            NA         NA      NA       NA    
## EmployeeNumber                    8.681e-02  6.103e-02   1.422 0.155269    
## EnvironmentSatisfaction          -6.267e+00  3.364e+01  -0.186 0.852252    
## GenderMale                        1.100e+02  7.442e+01   1.478 0.139715    
## HourlyRate                       -3.591e-01  1.824e+00  -0.197 0.844003    
## JobInvolvement                    1.677e+01  5.321e+01   0.315 0.752698    
## JobLevel                          2.783e+03  8.340e+01  33.375  < 2e-16 ***
## JobRoleHuman Resources           -2.053e+02  5.157e+02  -0.398 0.690663    
## JobRoleLaboratory Technician     -5.891e+02  1.714e+02  -3.437 0.000618 ***
## JobRoleManager                    4.280e+03  2.830e+02  15.122  < 2e-16 ***
## JobRoleManufacturing Director     1.809e+02  1.696e+02   1.067 0.286497    
## JobRoleResearch Director          4.077e+03  2.193e+02  18.592  < 2e-16 ***
## JobRoleResearch Scientist        -3.494e+02  1.705e+02  -2.049 0.040790 *  
## JobRoleSales Executive            5.263e+02  3.576e+02   1.472 0.141449    
## JobRoleSales Representative       8.531e+01  3.918e+02   0.218 0.827703    
## JobSatisfaction                   3.278e+01  3.344e+01   0.980 0.327278    
## MaritalStatusMarried              6.708e+01  1.002e+02   0.669 0.503497    
## MaritalStatusSingle               1.128e+01  1.361e+02   0.083 0.933978    
## MonthlyRate                      -9.505e-03  5.143e-03  -1.848 0.064946 .  
## NumCompaniesWorked                5.421e+00  1.691e+01   0.321 0.748622    
## OverTimeYes                      -1.394e+01  8.434e+01  -0.165 0.868787    
## PercentSalaryHike                 2.586e+01  1.581e+01   1.635 0.102351    
## PerformanceRating                -3.235e+02  1.614e+02  -2.004 0.045368 *  
## RelationshipSatisfaction          1.640e+01  3.339e+01   0.491 0.623375    
## StandardHours                            NA         NA      NA       NA    
## StockOptionLevel                 -2.758e+00  5.740e+01  -0.048 0.961695    
## TotalWorkingYears                 5.080e+01  1.098e+01   4.627  4.3e-06 ***
## TrainingTimesLastYear             2.436e+01  2.912e+01   0.837 0.403111    
## WorkLifeBalance                  -3.472e+01  5.161e+01  -0.673 0.501284    
## YearsAtCompany                   -2.750e+00  1.370e+01  -0.201 0.840925    
## YearsInCurrentRole                3.398e+00  1.711e+01   0.199 0.842584    
## YearsSinceLastPromotion           3.084e+01  1.532e+01   2.013 0.044405 *  
## YearsWithCurrManager             -2.691e+01  1.669e+01  -1.613 0.107210    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1055 on 823 degrees of freedom
## Multiple R-squared:  0.9501, Adjusted R-squared:  0.9473 
## F-statistic: 340.7 on 46 and 823 DF,  p-value: < 2.2e-16
#p val < alpha of .05, it affects the Salary Variable

#Model1_Preds = predict(Model1_fit, newdata = NoSalary) #this is an example of predict function you would want to use
#as.data.frame(Model1_Preds)
#write.csv(Model1_Preds,"Model1PredictionsNoSalaryRenuKarthikeyan.csv")

Looking at this summary of model 1 output, it indicates that the statistically significant p values are Business Travel, JobLevel, Job Role, Performance rating, Total working Years, and Years since last promotion. The F-statistic tests the overall significance of the model. The F-statistic is 340.7 with a very small p-value (< 2.2e-16), suggests that at least one predictor variable is significantly related to Monthly Income.There are two coefficients not defined because of singularities. This might indicate multicollinearity, where two or more predictor variables are highly correlated.

Linear Regression using Select predictors to Determine Salary

Model2_fit = lm(MonthlyIncome ~ NumCompaniesWorked + Age + Gender + MaritalStatus + JobInvolvement + JobRole + DistanceFromHome + JobLevel + Education, data = Attritiondata)
summary(Model2_fit) # P value overall implies that at least one of my variables' slope != 0. 
## 
## Call:
## lm(formula = MonthlyIncome ~ NumCompaniesWorked + Age + Gender + 
##     MaritalStatus + JobInvolvement + JobRole + DistanceFromHome + 
##     JobLevel + Education, data = Attritiondata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3536.4  -699.4   -39.1   659.8  4178.5 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -600.926    320.383  -1.876  0.06105 .  
## NumCompaniesWorked              17.866     15.366   1.163  0.24526    
## Age                             12.875      4.962   2.595  0.00963 ** 
## GenderMale                     120.277     75.308   1.597  0.11061    
## MaritalStatusMarried           106.159     95.406   1.113  0.26615    
## MaritalStatusSingle             23.451    103.944   0.226  0.82155    
## JobInvolvement                  15.469     52.613   0.294  0.76882    
## JobRoleHuman Resources        -327.849    255.950  -1.281  0.20057    
## JobRoleLaboratory Technician  -534.026    172.285  -3.100  0.00200 ** 
## JobRoleManager                3933.510    232.847  16.893  < 2e-16 ***
## JobRoleManufacturing Director   92.825    169.877   0.546  0.58492    
## JobRoleResearch Director      3919.755    218.518  17.938  < 2e-16 ***
## JobRoleResearch Scientist     -246.097    171.976  -1.431  0.15280    
## JobRoleSales Executive        -122.894    146.575  -0.838  0.40202    
## JobRoleSales Representative   -392.881    217.075  -1.810  0.07067 .  
## DistanceFromHome                -7.908      4.553  -1.737  0.08281 .  
## JobLevel                      3042.789     69.854  43.559  < 2e-16 ***
## Education                      -33.713     37.379  -0.902  0.36736    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1078 on 852 degrees of freedom
## Multiple R-squared:  0.9461, Adjusted R-squared:  0.945 
## F-statistic: 878.9 on 17 and 852 DF,  p-value: < 2.2e-16
NoSalary$MonthlyIncome = predict(Model2_fit,newdata = NoSalary)

#Model2_Preds<- NoSalary %>% select(c("ID","MonthlyIncome"))
#as.data.frame(Model2_Preds)
#write.csv(Model2_Preds,"Model2PredictionsNoSalaryRenuKarthikeyan.csv", row.names = T)

These select predictors were chosen because of the insights from the EDA. I thought they were significant predictors. The coefficient for JobLevel is 3042.789. This suggests that, on average, an increase of one unit in JobLevel is associated with an increase of 3042.789 dollars in MonthlyIncome.Likewise, Age has a coefficient of 12.875 indicating that a year increase in age, results in 12.88 dollars additional monthly income, holding all other variables constant. Males have a coefficient of 120, indicating that holding all other variables constant, males make an additional 120 dollars compared to women monthly. The coefficient for distance from home has a. -7.908, which indicates each additional mile away from home may result in a monthly salary decrease by -7 dollars.

The statistically significant p values (<.10) are Age, certain Job Roles – like Laboratory Technician, Manager, and Research director, Job Level;

Overall, Model 2 appears to have a high R-squared value, indicating a good fit to the data. Many predictors are statistically significant, suggesting they contribute to determining Monthly Income

Split and Train Linear Regression Models to Predict Salary, with average of cross validation

set.seed(1234)
TrainObs = sample(seq(1,dim(Attritiondata)[1]),round(.8*dim(Attritiondata)[1]),replace = FALSE)
SalaryTrain = Attritiondata[TrainObs,]
head(SalaryTrain,5)
##      ID Age Attrition BusinessTravel DailyRate             Department
## 284 284  31        No  Travel_Rarely       691                  Sales
## 848 848  39        No  Travel_Rarely      1132 Research & Development
## 101 101  27        No  Travel_Rarely      1377 Research & Development
## 623 623  34        No  Travel_Rarely       182 Research & Development
## 645 645  41       Yes     Non-Travel       906 Research & Development
##     DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 284                7         3      Marketing             1            438
## 848                1         3        Medical             1            417
## 101               11         1  Life Sciences             1           1434
## 623                1         4  Life Sciences             1            797
## 645                5         2  Life Sciences             1           1210
##     EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 284                       4   Male         73              3        2
## 848                       3   Male         48              4        3
## 101                       2   Male         91              3        1
## 623                       2 Female         72              4        1
## 645                       1   Male         95              2        1
##                       JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 284           Sales Executive               4      Divorced          7547
## 848 Healthcare Representative               4      Divorced          9613
## 101     Laboratory Technician               1       Married          2099
## 623        Research Scientist               4        Single          3280
## 645        Research Scientist               1      Divorced          2107
##     MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 284        7143                  4       No                12                 3
## 848       10942                  0       No                17                 3
## 101        7679                  0       No                14                 3
## 623       13551                  2       No                16                 3
## 645       20293                  6       No                17                 3
##     RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 284                        4            80                3                13
## 848                        1            80                3                19
## 101                        2            80                0                 6
## 623                        3            80                0                10
## 645                        1            80                1                 5
##     TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 284                     3               3              7                  7
## 848                     5               2             18                 10
## 101                     3               4              5                  0
## 623                     2               3              4                  2
## 645                     2               1              1                  0
##     YearsSinceLastPromotion YearsWithCurrManager
## 284                       1                    7
## 848                       3                    7
## 101                       1                    4
## 623                       1                    3
## 645                       0                    0
SalaryTest = Attritiondata[-TrainObs,]
head(SalaryTest,5)
##    ID Age Attrition    BusinessTravel DailyRate             Department
## 5   5  24        No Travel_Frequently       567 Research & Development
## 8   8  37        No     Travel_Rarely       309                  Sales
## 9   9  34        No     Travel_Rarely      1333                  Sales
## 16 16  31        No        Non-Travel      1188                  Sales
## 18 18  46        No        Non-Travel      1144 Research & Development
##    DistanceFromHome Education   EducationField EmployeeCount EmployeeNumber
## 5                 2         1 Technical Degree             1           1646
## 8                10         4    Life Sciences             1           1105
## 9                10         4    Life Sciences             1           1055
## 16               20         2        Marketing             1            947
## 18                7         4          Medical             1            487
##    EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 5                        1 Female         32              3        1
## 8                        4 Female         88              2        2
## 9                        3 Female         87              3        1
## 16                       4 Female         45              3        2
## 18                       3 Female         30              3        2
##                   JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 5      Research Scientist               4        Single          3760
## 8         Sales Executive               4      Divorced          6694
## 9    Sales Representative               3       Married          2220
## 16        Sales Executive               3       Married          6932
## 18 Manufacturing Director               3       Married          5258
##    MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 5        17218                  1      Yes                13                 3
## 8        24223                  2      Yes                14                 3
## 9        18410                  1      Yes                19                 3
## 16       24406                  1       No                13                 3
## 18       16044                  2       No                14                 3
##    RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 5                         3            80                0                 6
## 8                         3            80                3                 8
## 9                         4            80                1                 1
## 16                        4            80                1                 9
## 18                        3            80                0                 7
##    TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 5                      2               3              6                  3
## 8                      5               3              1                  0
## 9                      2               3              1                  1
## 16                     2               2              9                  8
## 18                     2               4              1                  0
##    YearsSinceLastPromotion YearsWithCurrManager
## 5                        1                    3
## 8                        0                    0
## 9                        0                    0
## 16                       0                    0
## 18                       0                    0
Model1_fit <- lm(MonthlyIncome ~ ., data = Attritiondata)
summary(Model1_fit)
## 
## Call:
## lm(formula = MonthlyIncome ~ ., data = Attritiondata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3708.8  -674.1    14.7   614.1  4100.0 
## 
## Coefficients: (2 not defined because of singularities)
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       6.013e+01  7.772e+02   0.077 0.938349    
## ID                               -2.343e-01  1.476e-01  -1.588 0.112713    
## Age                              -1.431e+00  5.649e+00  -0.253 0.800110    
## AttritionYes                      8.904e+01  1.154e+02   0.771 0.440729    
## BusinessTravelTravel_Frequently   1.895e+02  1.420e+02   1.334 0.182441    
## BusinessTravelTravel_Rarely       3.720e+02  1.200e+02   3.099 0.002005 ** 
## DailyRate                         1.452e-01  9.129e-02   1.591 0.112062    
## DepartmentResearch & Development  1.234e+02  4.768e+02   0.259 0.795866    
## DepartmentSales                  -4.594e+02  4.877e+02  -0.942 0.346580    
## DistanceFromHome                 -6.237e+00  4.578e+00  -1.362 0.173417    
## Education                        -3.743e+01  3.716e+01  -1.007 0.314105    
## EducationFieldLife Sciences       1.352e+02  3.692e+02   0.366 0.714248    
## EducationFieldMarketing           1.377e+02  3.914e+02   0.352 0.725050    
## EducationFieldMedical             3.326e+01  3.699e+02   0.090 0.928376    
## EducationFieldOther               9.152e+01  3.946e+02   0.232 0.816664    
## EducationFieldTechnical Degree    9.680e+01  3.843e+02   0.252 0.801179    
## EmployeeCount                            NA         NA      NA       NA    
## EmployeeNumber                    8.681e-02  6.103e-02   1.422 0.155269    
## EnvironmentSatisfaction          -6.267e+00  3.364e+01  -0.186 0.852252    
## GenderMale                        1.100e+02  7.442e+01   1.478 0.139715    
## HourlyRate                       -3.591e-01  1.824e+00  -0.197 0.844003    
## JobInvolvement                    1.677e+01  5.321e+01   0.315 0.752698    
## JobLevel                          2.783e+03  8.340e+01  33.375  < 2e-16 ***
## JobRoleHuman Resources           -2.053e+02  5.157e+02  -0.398 0.690663    
## JobRoleLaboratory Technician     -5.891e+02  1.714e+02  -3.437 0.000618 ***
## JobRoleManager                    4.280e+03  2.830e+02  15.122  < 2e-16 ***
## JobRoleManufacturing Director     1.809e+02  1.696e+02   1.067 0.286497    
## JobRoleResearch Director          4.077e+03  2.193e+02  18.592  < 2e-16 ***
## JobRoleResearch Scientist        -3.494e+02  1.705e+02  -2.049 0.040790 *  
## JobRoleSales Executive            5.263e+02  3.576e+02   1.472 0.141449    
## JobRoleSales Representative       8.531e+01  3.918e+02   0.218 0.827703    
## JobSatisfaction                   3.278e+01  3.344e+01   0.980 0.327278    
## MaritalStatusMarried              6.708e+01  1.002e+02   0.669 0.503497    
## MaritalStatusSingle               1.128e+01  1.361e+02   0.083 0.933978    
## MonthlyRate                      -9.505e-03  5.143e-03  -1.848 0.064946 .  
## NumCompaniesWorked                5.421e+00  1.691e+01   0.321 0.748622    
## OverTimeYes                      -1.394e+01  8.434e+01  -0.165 0.868787    
## PercentSalaryHike                 2.586e+01  1.581e+01   1.635 0.102351    
## PerformanceRating                -3.235e+02  1.614e+02  -2.004 0.045368 *  
## RelationshipSatisfaction          1.640e+01  3.339e+01   0.491 0.623375    
## StandardHours                            NA         NA      NA       NA    
## StockOptionLevel                 -2.758e+00  5.740e+01  -0.048 0.961695    
## TotalWorkingYears                 5.080e+01  1.098e+01   4.627  4.3e-06 ***
## TrainingTimesLastYear             2.436e+01  2.912e+01   0.837 0.403111    
## WorkLifeBalance                  -3.472e+01  5.161e+01  -0.673 0.501284    
## YearsAtCompany                   -2.750e+00  1.370e+01  -0.201 0.840925    
## YearsInCurrentRole                3.398e+00  1.711e+01   0.199 0.842584    
## YearsSinceLastPromotion           3.084e+01  1.532e+01   2.013 0.044405 *  
## YearsWithCurrManager             -2.691e+01  1.669e+01  -1.613 0.107210    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1055 on 823 degrees of freedom
## Multiple R-squared:  0.9501, Adjusted R-squared:  0.9473 
## F-statistic: 340.7 on 46 and 823 DF,  p-value: < 2.2e-16
#Model1_Preds = predict(Model1_fit, newdata = NoSalary)
#as.data.frame(Model1_Preds)
#write.csv(Model1_Preds,"Model1PredictionsNoSalaryRenuKarthikeyan")
    
#Cross Validation and Mean Square Predictor Error Calculation
    numMSPEs = 1000
    MSPEHolderModel1 = numeric(numMSPEs)
    MSPEHolderModel2 = numeric(numMSPEs)
    RMSEHolderModel1 = numeric(numMSPEs)
    RMSEHolderModel2 = numeric(numMSPEs)
    
    for (i in 1:numMSPEs)
    {
      TrainObs = sample(seq(1,dim(Attritiondata)[1]),round(.8*dim(Attritiondata)[1]),replace = FALSE)
      SalaryTrain = Attritiondata[TrainObs,]
      head(SalaryTrain,5)
      SalaryTest = Attritiondata[-TrainObs,]
      head(SalaryTest,5)
      Model1_fit <- lm(MonthlyIncome ~ ., data = SalaryTrain)
      Model1_Preds = predict(Model1_fit, newdata = SalaryTest)
      
      #MSPE Model 1
      MSPE = mean((SalaryTest$MonthlyIncome - Model1_Preds)^2)
      MSPE
      MSPEHolderModel1[i] = MSPE
      RMSEHolderModel1[i] = sqrt(MSPE)
      
      
      #Model 2
      Model2_fit = lm(MonthlyIncome ~ NumCompaniesWorked + Age + Gender + MaritalStatus + JobInvolvement + JobRole + DistanceFromHome + JobLevel + Education, data = SalaryTrain)
      Model2_Preds = predict(Model2_fit,newdata = SalaryTest)
      MSPE = mean((SalaryTest$MonthlyIncome - Model2_Preds)^2)
      MSPE
      MSPEHolderModel2[i] = MSPE
      RMSEHolderModel2[i] = sqrt(MSPE)
    }
    
    mean(MSPEHolderModel1)
## [1] 1196649
    mean(MSPEHolderModel2)
## [1] 1194877
    mean(RMSEHolderModel1)
## [1] 1091.936
    mean(RMSEHolderModel2)
## [1] 1091.156
    AIC(Model1_fit)
## [1] 11699.08
    AIC(Model2_fit)
## [1] 11698.73
  • Data was split 80/20 – for training and testing
  • Ran both model 1 and 2 to predict Monthly Income on the testing set
  • Cross Validation Process with 1000 iterations where MSPE and RMSE are calculated for both models
  • Summary Statistics of mean MSPE and mean RMSE for both models

Conclusion: Model 2 is the better fit as it has a lower mean RMSE and lower mean MSPE

KNN and Naive Bayes

When I initially attempted to use all predictors for KNN, each time, the model would unfortunately run into errors and say “Warning: NAs introduced by coercionError in knn(train_predictors, test_predictors, response_train, prob = TRUE,: NA/NaN/Inf in foreign function call (arg 6)”, although there are no missing values in the Attritiondata data set. I later realized that it was due to NAs being assigned to the categorical variables during the KNN chunk. I created dummy columns for these categorical variables to get the KNN to function without running into errors. Below is the code used to create the dummy columns from the FastDummies package in R.

Preparation of the Data - Including Dummy Columns for Categorical Variables

Attritiondata$BusinessTravel<- as.factor(Attritiondata$BusinessTravel)
Attritiondata$Department<- as.factor(Attritiondata$Department)
Attritiondata$EducationField<- as.factor(Attritiondata$EducationField)
Attritiondata$Gender<- as.factor(Attritiondata$Gender)
Attritiondata$JobRole<- as.factor(Attritiondata$JobRole)
Attritiondata$MaritalStatus<- as.factor(Attritiondata$MaritalStatus)
Attritiondata$OverTime<- as.factor(Attritiondata$OverTime)


Attritiondata<-dummy_cols(Attritiondata,select_columns=c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))

Attritiondata <- Attritiondata %>% select(-c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))

#Doing Same for Attrition Test Data Set (AttritionTest)
AttritionTest$BusinessTravel<- as.factor(AttritionTest$BusinessTravel)
AttritionTest$Department<- as.factor(AttritionTest$Department)
AttritionTest$EducationField<- as.factor(AttritionTest$EducationField)
AttritionTest$Gender<- as.factor(AttritionTest$Gender)
AttritionTest$JobRole<- as.factor(AttritionTest$JobRole)
AttritionTest$MaritalStatus<- as.factor(AttritionTest$MaritalStatus)
AttritionTest$OverTime<- as.factor(AttritionTest$OverTime)


AttritionTest<-dummy_cols(AttritionTest,select_columns=c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))

AttritionTest <- AttritionTest %>% select(-c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))


diff_columns_df1 <- setdiff(names(Attritiondata), names(AttritionTest))
cat("Columns in df2 but not in df1:", paste(diff_columns_df1, collapse = ", "), "\n")
## Columns in df2 but not in df1: Attrition

KNN Model and Confusion Matrix - Training with All predictors

set.seed(1234)
iterations <- 100
numks <- 10
splitPerc <- 0.8

masterAcc <- matrix(nrow = iterations, ncol = numks)


for (j in 1:iterations) {
  trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
  train <- as.data.frame(Attritiondata[trainIndices, ])
  test <- as.data.frame(Attritiondata[-trainIndices, ])
  
  response_variable <- "Attrition"
  response_train <- factor(train[[response_variable]])
  response_test <- factor(test[[response_variable]])
 
  
  # Select columns for predictors
  selected_columns <- c(1, 2, 4:36)  # Adjust this range as needed
  
  #selected_columns <- c("ID","Age","BusinessTravel","DailyRate","Department", "DistanceFromHome", "Education", "EducationField", "EmployeeCount", "EmployeeNumber","EnvironmentSatisfaction", "Gender", "HourlyRate", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "OverTime", "PercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "StandardHours", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")
  
  # Extract the selected columns
  train_predictors <- train[, selected_columns, drop = FALSE]
  test_predictors <- test[, selected_columns, drop = FALSE]

  train_predictors <- apply(train_predictors, 2, as.numeric)
  test_predictors <- apply(test_predictors, 2, as.numeric)
  
  train_predictors <- scale(train_predictors)
  test_predictors <- scale(test_predictors)
  
  # Convert to numeric matrices
  train_predictors <- as.matrix(train_predictors)
  test_predictors <- as.matrix(test_predictors)

  # Remove infinite values
  train_predictors[!is.finite(train_predictors)] <- 0
  test_predictors[!is.finite(test_predictors)] <- 0
  
  
if (sum(response_train == "Yes") > 0 && sum(response_test == "Yes") > 0) {
  for (i in 1:numks) {
    classifications <- knn(train_predictors, test_predictors, response_train, prob = TRUE, k = i)
    table(classifications, response_test)
    CM_AllK<- confusionMatrix(table(classifications, response_test), positive = "Yes")
    masterAcc[j, i] <- CM_AllK$overall[1]
     }
  }
}
CM_AllK
## Confusion Matrix and Statistics
## 
##                response_test
## classifications  No Yes
##             No  143  26
##             Yes   2   3
##                                           
##                Accuracy : 0.8391          
##                  95% CI : (0.7759, 0.8903)
##     No Information Rate : 0.8333          
##     P-Value [Acc > NIR] : 0.4685          
##                                           
##                   Kappa : 0.134           
##                                           
##  Mcnemar's Test P-Value : 1.383e-05       
##                                           
##             Sensitivity : 0.10345         
##             Specificity : 0.98621         
##          Pos Pred Value : 0.60000         
##          Neg Pred Value : 0.84615         
##              Prevalence : 0.16667         
##          Detection Rate : 0.01724         
##    Detection Prevalence : 0.02874         
##       Balanced Accuracy : 0.54483         
##                                           
##        'Positive' Class : Yes             
## 
MeanAcc = colMeans(masterAcc); MeanAcc
##  [1] 0.7946552 0.7893103 0.8301724 0.8294828 0.8386782 0.8394828 0.8433333
##  [8] 0.8406897 0.8425287 0.8426437
plot(seq(1, numks, 1), MeanAcc, type = "l", ylab = "Mean Accuracy (Positive Class: Yes)")

which.max(MeanAcc)
## [1] 7
max(MeanAcc)
## [1] 0.8433333

From the plot, we see that the best k is k = 7. The overall accuracy is 83.91%, but sensitivity (True Positive Rate) is low (10.35%).The model is better at correctly predicting the majority class (“No”) but struggles with the minority class (“Yes”).

Looking specifically at the Confusion Matrix statistics, this is the output and the interpretation of each of these statistics: - Sensitivity (True Positive Rate): 0.10345 The proportion of actual positives correctly predicted for those who attrited. - Specificity (True Negative Rate): 0.98621 The proportion of actual negatives correctly predicted (for those who did not leave the company) - Positive Predictive Value (Precision): 0.60000 The proportion of predicted positives that are true positives (attrited correctly identified as attrited) - Negative Predictive Value: 0.84615 The proportion of predicted negatives that are true negatives. (not attrited correctly identified as not attrited) - Prevalence: 0.16667 The proportion of actual positives in the dataset. (proportion of attrited in the overall dataset)

TRYING THRESHOLD CHANGE for KNN - All predictors; k = 7

set.seed(1234)
iterations<- 100
accuracy_table <- numeric(iterations)
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
accuracy_table <- numeric(iterations)  

  train <- as.data.frame(Attritiondata[trainIndices, ])
  test <- as.data.frame(Attritiondata[-trainIndices, ])

  train_features<- train[, -which(names(train) == "Attrition")]
  test_features<- test[, -which(names(test) == "Attrition")]
  
  train_target<- train$Attrition
 # train_scaled<- scale(train_features) Using this in the knn returns "No missing values allowed"
  #test_scaled<- scale(test_features)

for (i in 1:iterations) {
  classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 7)
  table(classifications, response_test)
  CM_AllK7 <- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
  accuracy_table[i] <- CM_AllK7$overall[1]
}

#print(accuracy_table)
avg_accuracy<-mean(accuracy_table[1])
avg_accuracy
## [1] 0.7988506
specificity_table<- numeric(iterations)
sensitivity_table<- numeric(iterations)
for (i in 1:iterations) {
   classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 7)
  table(classifications, response_test)
  CM_AllK7 <- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
  specificity_table[i] <- CM_AllK7$byClass['Specificity']
  sensitivity_table[i] <- CM_AllK7$byClass['Sensitivity']
}
CM_AllK7
## Confusion Matrix and Statistics
## 
##      classifications
##        No Yes
##   No  137   3
##   Yes  32   2
##                                           
##                Accuracy : 0.7989          
##                  95% CI : (0.7315, 0.8557)
##     No Information Rate : 0.9713          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0552          
##                                           
##  Mcnemar's Test P-Value : 2.214e-06       
##                                           
##             Sensitivity : 0.40000         
##             Specificity : 0.81065         
##          Pos Pred Value : 0.05882         
##          Neg Pred Value : 0.97857         
##              Prevalence : 0.02874         
##          Detection Rate : 0.01149         
##    Detection Prevalence : 0.19540         
##       Balanced Accuracy : 0.60533         
##                                           
##        'Positive' Class : Yes             
## 
avg_specificity<-mean(specificity_table[1])
avg_specificity
## [1] 0.8106509
avg_sensitivity<-mean(sensitivity_table[1])
avg_sensitivity
## [1] 0.4
###### New Threshold using classifications which used k = 7 from above

#classifications
#attributes(classifications) # Look at possible attributes
#attributes(classifications)$prob # Probability of what was classified for that observation

probs = ifelse(classifications == "Yes",attributes(classifications)$prob, 1- attributes(classifications)$prob)

summary(Attritiondata$Attrition)
##  No Yes 
## 730 140
140/(730+140) #16.09%
## [1] 0.1609195
NewClass = ifelse(probs > .1609, "Yes", "No")
NewClass <- factor(NewClass, levels = levels(response_test))
table(NewClass,response_test)
##         response_test
## NewClass  No Yes
##      No  115  18
##      Yes  30  11
CM_NewThreshold <- confusionMatrix(table(NewClass, response_test), positive = "Yes", mode = "everything")
CM_NewThreshold
## Confusion Matrix and Statistics
## 
##         response_test
## NewClass  No Yes
##      No  115  18
##      Yes  30  11
##                                           
##                Accuracy : 0.7241          
##                  95% CI : (0.6514, 0.7891)
##     No Information Rate : 0.8333          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.1479          
##                                           
##  Mcnemar's Test P-Value : 0.1124          
##                                           
##             Sensitivity : 0.37931         
##             Specificity : 0.79310         
##          Pos Pred Value : 0.26829         
##          Neg Pred Value : 0.86466         
##               Precision : 0.26829         
##                  Recall : 0.37931         
##                      F1 : 0.31429         
##              Prevalence : 0.16667         
##          Detection Rate : 0.06322         
##    Detection Prevalence : 0.23563         
##       Balanced Accuracy : 0.58621         
##                                           
##        'Positive' Class : Yes             
## 

Similar to before, the loop ran for 100 iterations, but k was set to 7 when it came to knn. There are 2 confusion matrices. We have one, without the threshold changes, and one with the threshold change. The accuracy went down with the threshold change, while sensitivity reduced by a lot, and specificity reduced by ~10%. The positive predictive value increased by 9%, and the negative pred value reduced by 14%.

The new threshold has a lower accuracy compared to the original kNN classification.Sensitivity is significantly lower for the new threshold, indicating that fewer true positives are captured.Specificity is slightly lower for the new threshold, indicating a decrease in correctly identified true negatives. Precision is lower for the new threshold, reflecting a decrease in the accuracy of positive predictions. The original KNN classification has a higher accuracy and sensitivity compared to the new threshold.

#Applying to the Test Model to predict attrition using KNN with all predictors (This is best model)

train_features<- Attritiondata[, -which(names(Attritiondata) == "Attrition")]
test_features<- AttritionTest[]#[, -which(names(AttritionTest)=="Attrition")]
#head(test_features,5)
train_target <- Attritiondata$Attrition

AttritionClassifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 7)
AttritionTest$Attrition <- AttritionClassifications
#head(AttritionTest,5)

AttritionPredictionsKNN<-AttritionTest%>%select(c("ID","Attrition"))
#head(AttritionPredictionsKNN,5)
write.csv(AttritionPredictionsKNN,"CaseStudy2AttritionPredictionsKNN_RenuKarthikeyan.csv", row.names = T)

KNN Using Select Predictors

After doing my Exploratory Data Analysis, I believe the important predictors for Attrition are: Gender, Department,Job Satisfaction,Distance From Home,Monthly Income, Job Level, Age, Over Time, Percent Salary Hike, Performance Rating, and Education.

set.seed(1234)
iterations = 100
numks = 10
splitPerc = .8

masterAcc = matrix(nrow = iterations, ncol = numks)

for (j in 1:iterations) {
  trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
  train <- as.data.frame(Attritiondata[trainIndices, ])
  test <- as.data.frame(Attritiondata[-trainIndices, ])
  
  response_train <- factor(train$Attrition)
  response_test <- factor(test$Attrition)

  #UPDATE HERE!!!!
  #train_predictors <- train[, c(2,6,7,8,13,16,18,20,24,25,26)]
  #test_predictors <- test[, c(2,6,7,8,13,16,18,20,24,25,26) ]

  selected_cols <- c(2, 6, 7, 8, 13, 16, 18, 20, 24, 25, 26)

  train_predictors <- train[, selected_cols] #subset function
  test_predictors <- test[, selected_cols]

  train_predictors <- apply(train_predictors, 2, as.numeric)
  test_predictors <- apply(test_predictors, 2, as.numeric)
  train_predictors <- scale(train_predictors)
  test_predictors <- scale(test_predictors)
  train_predictors <- as.matrix(train_predictors)
  test_predictors <- as.matrix(test_predictors)

  # Remove infinite values
  train_predictors[!is.finite(train_predictors)] <- 0
  test_predictors[!is.finite(test_predictors)] <- 0

    for (i in 1:numks) {
      classifications <- knn(train_predictors, test_predictors, response_train, prob =TRUE, k = i)
      table(classifications, response_test)
      CM_Select<- confusionMatrix(table(classifications, response_test), positive = "Yes")
      masterAcc[j, i] <- CM_Select$overall[1]
    }
}
CM_Select
## Confusion Matrix and Statistics
## 
##                response_test
## classifications  No Yes
##             No  144  24
##             Yes   4   2
##                                           
##                Accuracy : 0.8391          
##                  95% CI : (0.7759, 0.8903)
##     No Information Rate : 0.8506          
##     P-Value [Acc > NIR] : 0.7085939       
##                                           
##                   Kappa : 0.0731          
##                                           
##  Mcnemar's Test P-Value : 0.0003298       
##                                           
##             Sensitivity : 0.07692         
##             Specificity : 0.97297         
##          Pos Pred Value : 0.33333         
##          Neg Pred Value : 0.85714         
##              Prevalence : 0.14943         
##          Detection Rate : 0.01149         
##    Detection Prevalence : 0.03448         
##       Balanced Accuracy : 0.52495         
##                                           
##        'Positive' Class : Yes             
## 
MeanAcc = colMeans(masterAcc); MeanAcc
##  [1] 0.7545977 0.7523563 0.8077586 0.8095977 0.8272414 0.8263793 0.8315517
##  [8] 0.8305172 0.8338506 0.8359195
plot(seq(1, numks, 1), MeanAcc, type = "l", ylab = "Mean Accuracy (Positive Class: Yes)")

which.max(MeanAcc)
## [1] 10
max(MeanAcc)
## [1] 0.8359195

We see that the best k is k = 10. The confusion matrix is taking an average of all the k’s tried. Here, the accuracy is 83.91%, similar to the initial average confusion matrix seen for all predictors using KNN. The sensitivity is quite low at 7.69%, suggesting that the model is struggling to correctly identify positive (true Attrition) instances. The model shows high specificity (97.30%), indicating a decent ability to correctly identify negative instances (not attrition). The positive predictive value (precision) is at 33.33%, indicating that among instances predicted as positive, about one-third are true positives.

TRYING k = 10 for KNN - Select predictors

iterations<- 100
accuracy_table <- numeric(iterations)
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))

  train <- as.data.frame(Attritiondata[trainIndices, ])
  test <- as.data.frame(Attritiondata[-trainIndices, ])

  train_features<- train[, selected_cols]
  test_features<- test[, selected_cols]
  
  train_target<- train$Attrition
  train_scaled<- scale(train_features)
  test_scaled<- scale(test_features)

for (i in 1:iterations) {
  classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 10)
  table(classifications, response_test)
  CM_SelectK <- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
  accuracy_table[i] <- CM_SelectK$overall[1]
}

#print(accuracy_table)
avg_accuracy<-mean(accuracy_table[1]); avg_accuracy
## [1] 0.8275862
specificity_table<- numeric(iterations)
sensitivity_table<- numeric(iterations)
for (i in 1:iterations) {
   classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 10)
  table(classifications, response_test)
  CM_SelectK<- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
  specificity_table[i] <- CM_SelectK$byClass['Specificity']
  sensitivity_table[i] <- CM_SelectK$byClass['Sensitivity']
}
CM_SelectK
## Confusion Matrix and Statistics
## 
##      classifications
##        No Yes
##   No  145   0
##   Yes  29   0
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7695, 0.8854)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.999e-07       
##                                           
##             Sensitivity :     NA          
##             Specificity : 0.8333          
##          Pos Pred Value :     NA          
##          Neg Pred Value :     NA          
##              Prevalence : 0.0000          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.1667          
##       Balanced Accuracy :     NA          
##                                           
##        'Positive' Class : Yes             
## 
avg_specificity<-mean(specificity_table[1])
avg_specificity
## [1] 0.8323699
avg_sensitivity<-mean(sensitivity_table[1])
avg_sensitivity
## [1] 0

We see the average specificity is 83.43% and sensitivity is 0%. This model’s performance is notably poor, with zero sensitivity, meaning it failed to correctly identify any positive instances. The specificity and negative predictive value are relatively high, but the lack of sensitivity indicates a serious limitation in identifying instances of the positive class. This suggests that the model might need further refinement or a different approach to address the imbalance and improve its ability to correctly classify positive instances.

Naive Bayes and Confusion Matrix - Training with All predictors

set.seed(1234)
iterations = 100

masterAcc = matrix(nrow = iterations)

splitPerc = .8 #Training / Test split Percentage

for(j in 1:iterations)
{
  trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
  train <- as.data.frame(Attritiondata[trainIndices, ])
  test <- as.data.frame(Attritiondata[-trainIndices, ])
  
   train$Attrition <- factor(train$Attrition, levels = c("Yes", "No"))
   test$Attrition <- factor(test$Attrition, levels = c("Yes", "No"))
  
    model <- naiveBayes(train[, -3], as.factor(train$Attrition), laplace = 1)
   predictions <- predict(model, test[, -3])
   confMatrix <- table(predictions, as.factor(test$Attrition))
   CM_NB_All <- confusionMatrix(confMatrix)
   masterAcc[j] <- CM_NB_All$overall[1]
}
CM_NB_All
## Confusion Matrix and Statistics
## 
##            
## predictions Yes No
##         Yes  23 92
##         No    2 57
##                                           
##                Accuracy : 0.4598          
##                  95% CI : (0.3841, 0.5368)
##     No Information Rate : 0.8563          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1211          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9200          
##             Specificity : 0.3826          
##          Pos Pred Value : 0.2000          
##          Neg Pred Value : 0.9661          
##              Prevalence : 0.1437          
##          Detection Rate : 0.1322          
##    Detection Prevalence : 0.6609          
##       Balanced Accuracy : 0.6513          
##                                           
##        'Positive' Class : Yes             
## 
MeanAcc = colMeans(masterAcc); MeanAcc
## [1] 0.6482184

Given all the predictors, the sensitivity rate of the model is the percentage of actual attrition cases correctly identified. It measures the model’s ability to capture employees who are truly at risk of attrition among all employees who actually attrite. This model shows imbalanced performance with high sensitivity (high attrition) but low specificity. It performs well in identifying actual positive instances (attrition) but struggles to correctly identify negative instances(no attrition). The positive predictive value is relatively low, indicating that when it predicts a positive instance, it has a 20% chance of being correct.

Naive Bayes and Confusion Matrix - Training; Using Select predictors to determine Attrition

set.seed(1234)
iterations = 100

masterAcc = matrix(nrow = iterations)

splitPerc = .8 #Training / Test split Percentage

for(j in 1:iterations)
{
  trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
  train <- as.data.frame(Attritiondata[trainIndices, ])
  test <- as.data.frame(Attritiondata[-trainIndices, ])
 
   train$Attrition <- factor(train$Attrition, levels = c("Yes", "No"))
   test$Attrition <- factor(test$Attrition, levels = c("Yes", "No"))
  
  model2 <- naiveBayes(train[, c(2, 6, 7, 8, 13, 16, 18, 20, 24, 25, 26)], as.factor(train$Attrition), laplace = 1)
  predictions <- predict(model2, test[, c(2, 6, 7, 8, 13, 16, 18, 20, 24, 25, 26)])
   confMatrix <- table(predictions, as.factor(test$Attrition))
   CM_NB_Select <- confusionMatrix(confMatrix)
   masterAcc[j] <- CM_NB_Select$overall[1]
}
CM_NB_Select
## Confusion Matrix and Statistics
## 
##            
## predictions Yes  No
##         Yes   3   4
##         No   22 145
##                                        
##                Accuracy : 0.8506       
##                  95% CI : (0.7888, 0.9)
##     No Information Rate : 0.8563       
##     P-Value [Acc > NIR] : 0.6357048    
##                                        
##                   Kappa : 0.133        
##                                        
##  Mcnemar's Test P-Value : 0.0008561    
##                                        
##             Sensitivity : 0.12000      
##             Specificity : 0.97315      
##          Pos Pred Value : 0.42857      
##          Neg Pred Value : 0.86826      
##              Prevalence : 0.14368      
##          Detection Rate : 0.01724      
##    Detection Prevalence : 0.04023      
##       Balanced Accuracy : 0.54658      
##                                        
##        'Positive' Class : Yes          
## 
MeanAcc = colMeans(masterAcc)

MeanAcc
## [1] 0.8312069
#use predict function on the "validation" sets. Use the same model. Test will be validation set. 
Predictions<- predict(model2,AttritionTest[,c(2, 5, 6, 7, 12, 15, 17, 19, 23, 24, 25)])

AttritionTest$Attrition<- Predictions
#View(AttritionTest$Attrition)

AttritionPredictionsNB<- AttritionTest %>% select(c("ID","Attrition"))
write.csv(AttritionPredictionsNB,"CaseStudy2AttritionPredictionsNB_RenuKarthikeyan.csv", row.names = T)

Sensitivity is 12%; 12% of actual attrition cases are correctly identified by the model. This suggests that the model may not be very effective at capturing employees who are truly at risk of attrition. This model has a higher accuracy of 85.06%, and has imbalanced performance with low sensitivity and high specificity. However, accuracy can be misleading, especially in imbalanced datasets where one class (e.g., “No attrition”) dominates. In this case, accuracy is not the best metric to evaluate the model’s performance. Positive predictive value (PPV) is at 42.86%. This indicates that when the model predicts attrition, there’s a 42.86% chance that the prediction is correct. It reflects the precision of the model in identifying true positive cases among all instances predicted as positive. The low prevalence (the proportion of actual positive cases in the dataset) of attrition, is 14.37%. This low prevalence contributes to the imbalanced nature of the performance metrics.

Best Model Determination from the Models tried for Attrition (KNN and Naive Bayes)

It looks like the KNN model at k = 7 with all predictors without threshold adjustment has better accuracy within the KNN models. The Naive Bayes model with select predictors has the highest accuracy, and is better than the other Naive Bayes model which included all predictors.

Naive Bayes has a higher accuracy (85.06%) compared to KNN (79.89%). However, accuracy alone may not be the most informative metric, especially in imbalanced datasets.KNN has a higher sensitivity (40.00%) compared to Naive Bayes (12.00%). Sensitivity is crucial when identifying cases of attrition, as it represents the proportion of true positive cases among all actual positive cases.Naive Bayes has higher specificity (97.32%) compared to KNN (81.07%). Specificity is important when minimizing false positives, but it’s essential to balance it with sensitivity.Naïve Bayes has a higher positive predictive value (precision) at 42.86%, while KNN has a lower precision at 5.88%. Precision indicates the accuracy of positive predictions.

Of the 2 best models (best in KNN and best in Naive Bayes), I think the Naive Bayes is the better model to predict Attrition, given the high positive prediction value(precision), accuracy, and narrower confidence interval.

Thank You

This concludes this presentation and analysis. Thank you for your time and I look forward to empowering Frito Lay with data-driven wisdom. I created a shiny app to visualize and notice insights regarding the attrition data. Please feel free to look into the link provided. If you have any questions, please feel free to reach out to me, my email is in the attached PowerPoint presentation. Thank you!